Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk
نویسندگان
چکیده
This paper discusses a machine translation evaluation task conducted using Amazon Mechanical Turk. We present a translation adequacy assessment task for untrained Arabicspeaking annotators and discuss several techniques for normalizing the resulting data. We present a novel 2-stage normalization technique shown to have the best performance on this task and further discuss the results of all techniques and the usability of the resulting adequacy scores.
منابع مشابه
Crowd-Sourcing of Human Judgments of Machine Translation Fluency
Human evaluation of machine translation quality is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment. However, achievement of consistent human judgments of machine translation is not easy, with decreasing levels of consistency reported in annual evaluation campaigns. In this paper we describe experiences g...
متن کاملFast, Cheap, and Creative: Evaluating Translation Quality Using Amazon's Mechanical Turk
Manual evaluation of translation quality is generally thought to be excessively time consuming and expensive. We explore a fast and inexpensive way of doing it using Amazon’s Mechanical Turk to pay small sums to a large number of non-expert annotators. For $10 we redundantly recreate judgments from a WMT08 translation task. We find that when combined non-expert judgments have a high-level of ag...
متن کاملContinuous Measurement Scales in Human Evaluation of Machine Translation
We explore the use of continuous rating scales for human evaluation in the context of machine translation evaluation, comparing two assessor-intrinsic qualitycontrol techniques that do not rely on agreement with expert judgments. Experiments employing Amazon’s Mechanical Turk service show that quality-control techniques made possible by the use of the continuous scale show dramatic improvements...
متن کاملFindings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation
This paper presents the results of the WMT10 and MetricsMATR10 shared tasks,1 which included a translation task, a system combination task, and an evaluation task. We conducted a large-scale manual evaluation of 104 machine translation systems and 41 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of trans...
متن کاملCrowdsourcing Music Similarity Judgments using Mechanical Turk
Collecting human judgments for music similarity evaluation has always been a difficult and time consuming task. This paper explores the viability of Amazon Mechanical Turk (MTurk) for collecting human judgments for audio music similarity evaluation tasks. We compared the similarity judgments collected from Evalutron6000 (E6K) and MTurk using the Music Information Retrieval Evaluation eXchange 2...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010